Skip to content

Conversation

@EricLBuehler
Copy link
Member

@EricLBuehler EricLBuehler commented Nov 17, 2025

  • indexed_moe_forward (fast path for ggml quants)
  • Improved usability of Context
  • Add full attn support for Metal SDPA
  • Fix bug w/ FlashAttn f16
  • Add necessary metal Device apis

@EricLBuehler EricLBuehler marked this pull request as ready for review November 17, 2025 22:53
Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent stuff

Comment on lines +170 to +179
/// Creates a new private buffer (not necessarily zeroed).
///
/// This is intentionally not in the Metal buffer pool to allow the efficient implementation of persistent buffers.
pub fn new_private_buffer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is nice to have, but I think we should name it something other than private buffer since that already means something for metal buffers (only available on gpu, ref).
We don't want to use actual metal private buffers as that isn't supported on iOS.

How about new_unpooled_buffer or new_persistent_buffer? :)

Copy link
Member Author

@EricLBuehler EricLBuehler Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this was a mistake on my part. The correct behavior that I intended for this function is to have:

  • private if not on iOS
  • shared/RESOURCE_OPTIONS if on iOS

Copy link
Member

@ivarflakstad ivarflakstad Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Could I ask why you want it to be private?
According to Apple's documentation there is no performance benefit, so private is usually used when you want to ensure that the cpu does not have access to the buffer for some specific reason. I'd wager a guess this kind of behaviour is frequently used in gaming.

Comment on lines +488 to +524
crate::bail!(
"The given quantized dtype {:?} is not supported for indexed_moe_forward!",
self.dtype()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud here. It would be nice to have automatic fallback to an approach that isn't as optimized, but still valid. Perhaps returning Result<Option<(CudaStorage, crate::Shape)>> is a decent starting point?
If None then fallback?

Not thinking we add this in this PR ofc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might work, the issue is that effectively indexed_moe_forward is a grouped gemm so we'd need existing infrastructure to run a grouped gemm.

Regardless, providing a grouped gemm functionality will be very useful!

@EricLBuehler
Copy link
Member Author

Addressed the review comments, the new_private_buffer method is now implemented correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants